9 research outputs found

    Efficient design space exploration of embedded microprocessors

    Get PDF

    Mechanistic analytical modeling of superscalar in-order processor performance

    Get PDF
    Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor. The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance

    MLPerf Inference Benchmark

    Full text link
    Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.Comment: ISCA 202

    Selecting representative benchmark inputs for exploring microprocessor design spaces

    Get PDF
    The design process of a microprocessor requires representative workloads to steer the search process toward an optimum design point for the target application domain. However, considering a broad set of workloads to cover the large space of potential workloads is infeasible given how time-consuming design space exploration typically is. Hence, it is crucial to select a small yet representative set of workloads, which leads to a shorter design cycle while yielding a (near) optimal design. Prior work has mostly looked into selecting representative benchmarks; however, limited attention was given to the selection of benchmark inputs and how this affects workload representativeness during design space exploration. Using a set of 1,000 inputs for a number of embedded benchmarks and a design space with around 1,700 design points, we find that selecting a single or three random input(s) per benchmark potentially (in a worst-case scenario) leads to a suboptimal design that is 56% and 33% off, on average, relative to the optimal design in our design space in terms of Energy-Delay Product (EDP). We then propose and evaluate a number of methods for selecting representative inputs and show that we can find the optimum design point with as few as three inputs

    Selecting representative benchmark inputs for exploring microprocessor design spaces

    Get PDF

    A Mechanistic Performance Model for Superscalar In-Order Processors

    Get PDF
    Mechanistic processor performance modeling builds an analytical model from understanding the underlying mechanisms in the processor and provides fundamental insight in program-microarchitecture interactions, as well as microarchitecture structure scaling trends and interactions. Whereas prior work in mechanistic performance modeling focused on superscalar out-of-order processors, this paper presents a mechanistic performance model for superscalar in-order processors. We find mechanistic modeling for inorder processors to be more challenging compared to outof-order processors because the latter are designed to hide latencies, and hence from a modeling perspective, detailed modeling of instruction execution latencies and dependencies is not required. The proposed mechanistic performance model for superscalar in-order processors models the impact of non-unit instruction execution latencies, inter-instruction dependencies, cache/TLB misses and branch mispredictions, and achieves an average performance prediction error of 2.5% compared to detailed cycle-accurate simulation. We extensively evaluate the model’s accuracy and we demonstrate its usefulness through three applications: (i) we compare inorder versus out-of-order performance, (ii) we quantify the impact of compiler optimizations on in-order performance, and (iii) we perform a power/performance design space exploration.

    How sensitive is processor customization to the workload's input datasets?

    No full text
    Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle - the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors
    corecore